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Spectral Clustering and Block Models: A 
Review And A New Algorithm 


Sharmodeep Bhattacharyya and Peter J. Bickel 


Abstract We focus on spectral clustering of unlabeled graphs and review some re¬ 
sults on clustering methods which achieve weak or strong consistent identification 
in data generated by such models. We also present a new algorithm which appears 
to perform optimally both theoretically using asymptotic theory and empirically. 


1 Introduction 

Since its introduction in [15], spectral analysis of various matrices associated to 
groups has become one of the most widely used clustering techniques in statistics 
and machine learning. 

In the context of unlabeled graphs, a number of methods, all of which come 
under the broad heading of spectral clustering have been proposed. These methods 
based on spectral analysis of adjacency matrices or some derived matrix such as 
one of the Laplacians ([31], [28], [23], [29], [32]) have been studied in connection 
with their effectiveness in identifying members of blocks in exchangeable graph 
block models. In this paper after introducing the methods and models, we intend 
to review some of the literature. We relate it to the results of Mossel, Neeman and 


Sharmodeep Bhattacharyya 

Oregon State University, Department of Statistics, 44 Kidder Hall, Corvallis, OR, e-mail: 
bhattash@science . oregonstate . edu 

Peter J. Bickel 

University of California at Berkeley, Department of Statistics, 367 Evans Hall, Berkeley, CA e- 
mail: bickel@ stat . berkeley . edu 


1 



2 


Sharmodeep Bhattacharyya and Peter J. Biekel 


Sly (2012) [26] and Massoulie (2014) [24], where it is shown that for very sparse 
models, there exists a phase transition below which members cannot be identified 
better than chance and also showed that above the phase transition one can do better 
using rather subtle methods. In [6] we develop a spectral clustering method based 
on the matrix of geodesic distances between nodes which can achieve the goals of 
the work we cited and in fact behaves well for all unlabeled networks, sparse, semi- 
sparse and dense. We give a statement and sketch the proof of these claims in [] but 
give a full argument for the sparse case considered by the above authors only in this 
paper. We give the necessary preliminaries in Section 2, more history in Section 3 
and show the theoretical properties of the method in Section 4. 


2 Preliminaries 

There are many standard methods of clustering based on numerical similarity matri¬ 
ces which are discussed in a number of monographs (Eg:Hartigan [19], Leroy and 
Rousseuw [30]). We shall not discuss these further. Our focus is on unlabeled graphs 
of n vertices characterized by adjacency matrices, A = | |a,;j | for n data points. With 
Qij = 1 if there is an edge between i and j and atj = 0 otherwise. The natural as¬ 
sumption then is, A = A^. Our basic goal is to divide the points in K sets such that 
on some average criterion the points in a given subset are more similar to each other 
than to those of other subsets. Our focus is on methods of clustering based on the 
spectrum (eigenvalues and eigenvectors) of A or related matrices. 


2.1 Notation and Formal Definition of Stochastic Block Model 

Definition 1. A graph 7t)) generated from the stochastic block model 

(SBM) with K blocks and parameters P £ (0,and 7Z £ (0,1)^ can be de¬ 
fined in following way - each vertex of graph G,, is assigned to a community 
c £ {l,...,fir}. The (ci,...,c„) are independent outcomes of multinomial draws 
with parameter n = (n\,... ^TIk), where ni > 0 for all i. Conditional on the label 
vector c = (ci,... ,c„), the edge variables A,y for i < j are independent Bernoulli 
variables with 


j |c] — t^CjCj — ^^^\Pnt^CjCj ; ^ 


(1) 
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where P = \Pab] and B = [Bat] are K x K symmetric matrices. We call P the con¬ 
nection probability matrix and B the kernel matrix for the connection. So, we have 
Pab < 1 for 2 i\\ a,b =\,... ,K, P\ <\ and < 1 element-wise. 

By definition Aj, =Aij, and A,, = 0 (no self-loops). 

This formulation is a reparametrization due to Bickel and Chen (2009) [8] of the 
definition of Holland and Leinhardt [20]. It permits separate consideration asymp¬ 
totically of the density of the graph and its structure as follows: 

P (Vertex 1 belongs to block a and vertex 2 to block b and are connected) = TtaTtbPab 

with Pab depending on n. Pab = p„min(Z?ahj 1 /Pn)- We can interpret p„ as the un¬ 
conditional probability of an edge and Bab essentially as 

P (Vertex 1 belongs to a and vertex 2 belongs to an edge between 1 and 2). 


SetfT = diag(;ri,...,;^A:). 

1. Define the matrices as M = FIB and S = TI^I^BTI^I^. 

2. Note that the eigenvalues of M are the same as the symmetric matrix S and in 
particular are real-valued. 

3. The eigenvalues of the expected adjacency matrix A = E(A) are also the same 
as those of S but with multiplicities. We denote the eigenvalues by their absolute 
order, Xi> IA 2 I > • • • > \Xk\. 

Let us denote (<pi,... ,<Pk), (Pi € as the eigenvectors of S corresponding to 
the eigenvalues If a set of X/s are equal to X, we choose eigenvec¬ 

tors from the eigenspace corresponding to the X as appropriate. Then, we have, 
(pi = and (j/i = n^l'^<pi as the left and right eigenvectors of M. Also, 

{(pi, (pj)jc = Lf=i ^k^ik^jk = Sij- The spectral decomposition of M, S and B are 

K K K 

5 = E ^kMk^ ^ = E ^kfPkfPk, M=Y^ XkWkipk- 

k=l k=l k=l 


2.2 Spectral Clustering 

The basic goal of community detection is to infer the node labels c from the data. 
Although we do not explicitly consider parameter estimation, they can be recovered 
from c, an estimate of (ci,..., c„) by 
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'=li= 


= a,Cj = b)^ l<a^b<K, 


( 2 ) 


where, 


_ \ natib, l<a,b<K,a:^b _ i ^ ^ 

Oab = < , ,. , ^ ^ ^ , ,na='^l{ci = a),\<a<K 

l<a<K,a = b 


There are a number of approaches for community detection based on modular¬ 
ities ([18], [8]), maximum likelihood and variational likelihood ([11], [7]) and ap¬ 
proximations such as semidefinite programming approaches [3], pseudolikelihood 
[2] but these all tend to be computationally intensive and/or require good initial 
assignments of blocks. The methods which have proved both computationally ef¬ 
fective and asymptotically correct in a sense we shall discuss are related to spectral 
analysis of the adjacency or related matrices.They differ in important details. 

Given an iixn symmetric matrix M based on A, the algorithms are of the form: 


1. Using the spectral decomposition of M or a related generalized eigenproblem. 

2. Obtain annx K matrix of K nx 1 vectors. 

3. Apply K means clustering to the n /T-dimensional row vectors of the matrix of 
Step 2. 

4. Identify the indices of the rows belonging to cluster j ,] = 1,..., /T with vertices 
belonging to block j. 

In addition to A, three graph Laplacian matrices discussed by von Luxburg (2007) 
[33], have been considered extensively, as well as some others we shall mention 
briefly below and the matrix we shall show has optimal asymptotic properties and 
discuss in greater detail. The matrices popularly considered are: 

• L = D —A: the graph Laplacian. 

• Lrw = D * A: the random walk Laplacian. 

• Lsym = the symmetric Laplacian. 

Here D = diag(Al), the diagonal matrix whose diagonal is the vector of row sums 
of A. She considers optimization problems which are relaxed versions of combina¬ 
torial problems which implicitly define clusters as sets of nodes with more internal 
than external edges. L and Lsym appear in two of these relaxations. 

The form of step 2 differs for L and Lsym with the K vectors of the L prob¬ 
lem corresponding to the top K eigenvalues of the generalized eigenvalue problem 
Lv = XDv ,while the n /T-dimensional vectors of the Lsym problem are obtained by 
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normalizing the rows of the matrix of K eigenvectors corresponding to the top K 
eigenvalues of Lsym- Their relation to the K block model is through asymptotics. 

Why is spectral clustering expected to work? Given A generated by a /C-block 
model, let c O (ni,... ,nA:) where, iia is the number of vertices assigned to type a. 
Then we can write, 

'K{A\c) = PQP'^ 

where, is a permutation matrix and Qnxn has succesive blocks of «i rows, ni rows 
and so on with all the vectors in each row the same. Thus rank(E(A|c) = K. The 
same is true of the asymptotic limit of L given c. 

If asymptotics as « —?> oo justify concentration of A or L around their expectations 
then we expect all eigenvalues other than the largest K in absolute value are small. It 
follows that the n rows of the K eigenvectors associated with the top K eigenvalues 
should be resolvable into K clusters in with cluster members identified with 
rows of A„xn, see [29], [32] for proofs. 


2.3 Asymptotics 

Now we can consider several asymptotic regimes as n Let A„ = np„ be the 
average degree of the graph. 

(I) The dense regime; A„ = Q{n). 

(II) The semi dense regime: Xn/log{n) —>■ 

(III) The semi sparse regime; Not semidense but X„ 

(IV) The sparse regime: A„ = (9(1). 

Here are some results in the different regimes. We define a method of vertex 
assignment to communities as a random map 5 : {l,...,/r} where 

randomness comes through the dependence of delta on A as a function. Thus spectral 
clustering using the various matrices which depend on A is such a 5. 

Definition 2. 5 is said to be strongly consistent if 

P(/ belongs to a and 5{i) = a for all i,a) —1 as « —oo. 


Note that the blocks are only determined up to permutation. 
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Bickel and Chen (2009) [8] show that in the (semi) dense regime a method called 
profile likelihood is strongly consistent under minimal identifiability conditions and 
later this result was extended [7] to fitting by maximum likelihood or variational 
likelihood. In fact, in the (semi) dense regime, the block model likelihood asymp¬ 
totically agrees with the joint likelihood of A and vertex block identities so that 
efficient estimation of all parameters is possible. It is easy to see that the result can¬ 
not hold in the (semi)sparse regime since isolated points then exist with probability 
1 . 

Unfortunately all of these methods are computationally intensive. Although spec¬ 
tral clustering is not strongly consistent, a slight variant, reassigning vertices in 
any cluster a which are maximally connected to another cluster b rather than a , 
is strongly consistent. 

Definition 3. 5 is said to be weakly consistent if and only if 

n 

W = n^^ ^ P (/ G a,5{i) ^ a\ii,a) = o(l) 

i=l 

Spectral clustering applied to A [32] or the Laplacians ([29] in the manner we 
have described) has been shown to be weakly consistent in the semi dense to dense 
regimes. Even weak consistency fails for parts of the sparse regime [1]. The best 
that can be hoped for is W < 5- A sharp problem has been posed and eventually 
resolved in a series of papers, Decelle et al [14], Mossel et al [27]. These writers 
considered the case K = 2,7ti = K 2 ,Bii = 822 - First, Decelle et al. [14] argued on 
physical grounds that if, F = 2{Bii — 812)^/(811 +B 12 ) < 1, then IT > 1/2 for 
any method and parameters are unestimable from the data even if they satisfy the 
minimal identifiability conditions given below. On the other hand Mossel et al [27] 
and independently Massoulie et al [24], devised admittedly slow methods such that 
if E > 1 then W < 1/2 and parameters can be estimated consistently. 

We now present a fast spectral clustering method given in greater detail in [6] 
which yields weak consistency for the semisparse regime on and also has the prop¬ 
erties of the Mossel et al and Massoulie methods. In fact, it reaches the phase tran¬ 
sition threshold for all K not just K=2, but still restricted to nj = 1/K, all j and 
Baa + 2^ : b ^ a] independent of a for all a. 

We note that Zhao et. al. (2015) [17] exhibit a two-stage algorithm which exhibits 
the same behavior but its properties in sparse case are unknown. The algorithm given 
in the next section involves spectral clustering of a new matrix, that of all geodesic 
distances between i and j. 
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3 Algorithm 

As usual let G„, an undirected graph on n vertices be the data, denote the vertex set 
hyV{G„) = {vi,.. ., V,,} and the edge set by ^(G,,) = {ei,... with cardinalities 
|y(G„)| = n and £'(G„)| = m. 

As usual a path between vertices u and vis a set of edges {(«, vi), (vi, V2), ..., v)} 

and the length of such a path is i. 

The algorithm we propose depends on the graph distance or geodesic distance 
between vertices in a graph. 

Definition 4. The Graph or Geodesic distance between two vertices i and j of 
graph G is given by the length of the shortest path between the vertices i and j, if 
they are connected. Otherwise, the distance is infinite. 

So, for any two vertices u,v GV (G), graph distance, dg is defined by 



min{£|3 path of length £ between u and v}, 
00 , if M and v are not connected 


For implementation, we can replace 00 by n + 1, when, u and v are not connected, 
since any path with loops can not be a geodesic. The main steps of the algorithm are 
as follows 

1. Find the graph distance matrix D = [dg{vi,Vj )]"for a given network but with 
distance upper bounded by klogn. Assign non-connected vertices an arbitrary 
high value. 

2. Perform hierarchical clustering to identify the giant component of graph G. 
Let nc=\V{G^)\. 

3. Normalize the graph distance matrix on G*', by 



4. Perform eigenvalue decomposition on D^. 

5. Consider the top K eigenvectors of normalized distance matrix and W be the 
nxK matrix formed by arranging the K eigenvectors as columns in W. Perform 
/T-means clustering on the rows W, that means, find annx K matrix C, which 
has K distinct rows and minimizes | |C — W| . 
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6. (Alternative to 5.) Perform Gaussian mixture model based clustering on the 
rows of W, when there is an indication of highly-varying average degree be¬ 
tween the communities. 

7. Let c : y I—?► [/f] be the block assignment function according to the clustering of 
the rows of W performed in either Step 5 or 6. 

Here are some important observations about the implementation of the algorithm - 

(a) There are standard algorithms for graph distance finding in the algorithmic graph 
theory literature. In the algorithmic graph theory literature the problem is known 
as the all pairs shortest path problem. The two most popular algorithms are 
Floyd-Warshall [16] [34] and Johnson’s algorithm [21]. 

(b) Step 3 of the algorithm is nothing but the classical multi-dimensional scaling 
(MDS) of the graph distance matrix. 

(c) In the Step 5 of the algorithm J^-means clustering is appropriate if the expected 
degree of the blocks are equal. However, if the expected degree of the blocks are 
different, this leads to multi scale behavior in the eigenvectors of the normalized 
distance matrix and bad behavior in practice. So, we perform Gaussian Mixture 
Model (GMM) based clustering instead of /T-means to take into account that. 

General theoretical results on the algorithm will be given in [6]. In this paper, 
we first restrict to the sparse regime We do so because the arguments in the sparse 
regime are essentially different from the others. Curiously, it is in the sparse and part 
of the semi-sparse regime only that the matrix Lf- concentrates to an n x n matrix 
with K distinct types of row vectors as for the other methods of spectral clustering. 
It does not concentrate in the dense regime, while the opposite is true of A and L. 
They do not concentrate outside the semidense regime. That the geodesic matrix 
does not concentrate in the dense regime can easily be seen since asymptotically all 
geodesic paths are of constant length. But the distributions of path lengths differs 
from block to block ensuring that the spectral clustering works. But we do not touch 
this further here. 


4 Theoretical Results 

Throughout this section we take p„ = ^ and specialize to the case 

B = (p-^)I/fxA: + ?ll^ 
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where, I is the identity and 1 = That is, all K blocks have the same 

probability p of connecting two block members and probability q of connecting 
members of two different blocks and p > q. We also assume that Tta = ^, a = 
all blocks are asymptotically of the same size. We restrict ourselves to this 
model here because it is the one treated by Mossel, Neeman and Sly (2013) [27] and 
already subtle technical details are not obscured. Here is the result we prove. 

Theorem 1. For the given model, if 

{p-qf>K{p+{K-l)q), (3) 


and our algorithm is applied, c results and c is the true assignment function, then. 


Notes; 


^El(c(vi) 7^c(v/)) < i 


^ 1 


(4) 


1. (3) marks the phase transition conjectured by [14]. 

2. A close reading of our proof shows that as {p — q)^/K{p + {K — l)q) ^ o°, 

^I"=il(c(v',:) 7^c(v,)) 40. 

We conjecture that our conclusion in fact holds under the following conditions, 

(Al) We consider Ai > 1, > maxj> 2 ^j, 1 < 7 < ^ and A/f > 0. ForM, there exists 

a k such that {M^)ab > 0 for all a,b = 1,... ,K. Also, Ttj > 0, for j = 1,... ,K. 
(A2) Each vertex has the same asymptotic average degree a > 1, that is, 

K K 

ot^Y^nkBak=Y^Mak, forallae {1,...,7^} 
k=l k=l 

(A3) We assume that 


A| > Al 

or alternatively, there exists real positive f, such that, 

K 

E 0A'(«)A/0i:(fe) < n, for all a,b = 
A:=l 


Note that (A1)-(A3) all hold for the case we consider. In fact, under our model. 
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with (A3) being the condition of the Theorem. 

Our argument will be stated in a form that is generalizable and we will indicate 
revisions in intermediate statements as needed, pointing in particular to a lemma 
whose conclusion only holds if an implication of (A3) we conjecture is valid. 

The theoretical analysis of the algorithm has two main parts - 

I. Finding the limiting distribution of graph distance between two typical vertices 
of type a and type b (where, a,b = 1,... ,K). This part of the analysis is highly 
dependent on results from multi-type branching processes and their relation with 
stochastic block models. The proof techniques and results are borrowed from [9], 
[5] and [4]. 

II. Finding the behavior of the top K eigenvectors of the graph distance matrix D 
using the limiting distribution of the typical graph distances. This part of anal¬ 
ysis is highly dependent on perturbation theory of linear operators. The proof 
techniques and results are borrowed from [22], [12] and [32]. 

We will state two theorems corresponding to I and II above. 

Theorem 2. Under our model, the graph distance c/g(m, v) between two uniformly 
chosen vertices of type a and b respectively, conditioned on being connected, satis¬ 
fies the following asymptotic relation - 

(i) If a = b, for any £ > Q, as n ^ o°, 

P[(l -e)Ti < dcM < (1 -f e)Ti] = 1 -o(l) (5) 


where, is the minimum real positive t, which satisfies the relation below. 



K-K ' 

K 


( 6 ) 


( ii)If a f b, for any £ > 0, as n ^ 


P[(l -e)T2 < doM < (1 -f e)T2] = 1 - o(l) (7) 


where, T 2 is the minimum real positive t, which satisfies the relation below. 




K 


( 8 ) 


In Theorem 2 we have a point-wise result. To use matrix perturbation theory for 
part II we need the following. 
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Theorem 3. Let Dg be the restriction of the geodesic matrix to vertices in the big 
component ofG„. Then, under our model. 


P 


r D 

* 

D 

< o{n) 
F 

L logu 


= 1-0(1) 


where, = C7i = Ti/log«, ifvj and vj have same type and D,y = 02 = '^l/^ogn, 
otherwise, where, and T 2 are solutions t in Eq. (6) and (8) respectively. 


To generalize Theorem 1, we need appropriate generalizations of Theorem 2 and 3. 
Heuristically, it may be argued that the generalizations {z^b), a,b = 1,... ,K should 
satisfy the equations, 

K 

L <Pk{a)X[(pk{b) = {S*)ab = n, fox a<b€ [K] (9) 

k=\ 

Our conjecture is that (A1)-(A3) imply that the equations have asymptotic solutions 
and that the statements of Theorem 2 and 3 hold with obvious modifications. 

Note that in Theorem 2, since Xj = X 2 , 2 < j < K there are effectively only two 
equations and modifications are also needed for other degeneracies in the parame¬ 
ters. We next turn to a branching process result in [10] which we will use heavily. 


4.1 A Key Branching Process Result 

As others have done we link the network formed by SBM with the tree network 
generated by multi-type Gabon-Watson branching process. In our case, the Multi¬ 
type branching process (MTBP) has type space 5 = {1,... ,K}, where a particle of 
type a G S is replaced in the next generation by a set of particles distributed as a 
Poisson process on S with intensity {Bab'ktb)f=\ = {Mab)f=\. Recall the definitions 
of B, M and S from Section 2.1. We denote this branching process, started with a 
single particle of type a, by S§B,K{a)- We write for the same process with the 
type of the initial particle random, distributed according to k. According to Theorem 

8.1 of Chapter 1 of [25], the branching process has a positive survival probability 
if 2-1 > 1, where, Ai is the Perron-Frobenius eigenvalue of M, a positive regular 
matrix. Recall that for our special M, Xi = -f 1. 

Definitions. (a)Define p{B,n',a) as the probability that the branching process, 
survives for eternity. 
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(b)Define, 

K 

p=p{B,7t)=Y,P{B,7t;a)7ta (10) 

a=l 

as the survival probability of the branching process ^Wb.k given that its initial 
distribution is n 

We denote Z, = (Z,(a))^^j as the population of particles of K different types, 
with Zt{a) denoting particles of type a, at generation t for the Poisson multi-type 
branching process with B and n as defined in Section 4. From Theorem 24 of 
[10], we get that 

Theorem 4 ([10]). Let j3 > 0 and Zq = x G be fixed. There exists C = C(x, j3 ) >0 
such that with probability at least 1 —n^P, for all k G [/T], all s,t > 0 , with 0 < s < t, 

|(0,,Z,)-Z*-'(().i,Z,)| <C(f + l)2A;'/"(logn)3/2 (11) 

Remark; The above stated theorem is a special case of the general theorem stated 
in [10]. The general theorem is required for generalizing Theorem 1. The general 
version of the theorem is 

Theorem 5 ([10]). Let j3 > 0 and Zq = x G be fixed. There exists C = 
C{x,l5) ^ 0 such thut with probability at least 1 — n ^, for all k ^ [^o] (where, 
is the largest integer such that > X\ for all k<KQ), all sj >Q, with Q<s<t, 

|((^),,z,) - A*-'(0,,Z,)| < C{t+\fxf{\ognfl^ (12) 

and for all kG [K]\[Kq], for all t >0, 

\{<^kA)\<C{t + \flf{\ognfl^ (13) 

Finally, for all k G [Ai]\[Aio], all t >0, E|(0j.,Z,)p < C(f + 1)^A[. 


4.2 The Neighborhood Exploration Process 

The neighborhood exploration process of a vertex v in graph G generated from an 
SBM gives us a handle on the link between local structures of a graph from SBM 
and multi-type branching process. Recall the definitions of SBM parameters from 
Section 2.1 and the definitions of Poisson multi-type branching process from Section 
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4.1 . We assume all vertices of graph G„ generated from a stochastic block model 
has been assigned a community or type (say) for vertex v',- G V(G„). 

The neighborhood exploration process, {G,v)l, of a vertex v' in graph G„, gen¬ 
erates a spanning tree of the induced subgraph of G„ consisting of vertices of at 
most L-distance from v. The spanning tree is formed from the exploration pro¬ 
cess which starts from a vertex v as the root in the random graph G„ generated 
from stochastic block model. The set of vertices of type a of the random graph 
G„ that are neighbors of v and has not been previously explored are called Fi ^(v) 
and A^i,a(v) = |ri_a(v)| for a = and A^i(v) = (A^ij(v),... So, 

ri(y) = {ru(y),...,rix( y)} are the children of the root y at step £ = 1 in the span¬ 
ning tree of the neighborhood exploration process. The neighborhood exploration 
process is repeated at second step by looking at the neighbors of type a of the ver¬ 
tices in r[(y) that has not been previously explored and the set is called T 2 ^a(y) and 
N 2 ,a{v) = |f 2 ,a(v)| for fl = 1,... ,F. Similarly, Viiv) = {f 2 ,i(v),... ,F 2 .ir(v)} are the 
children of vertices Fi (y) at step f = 2 in the spanning tree of the neighborhood 
exploration process. The exploration process is continued until step i = L. Note 
that the process stops when all the vertices in G„ has been explored. So, if G„ is 
connected, then, L < the diameter of the graph G„. 

Since, we either consider G„ connected or only the giant component of G„, the 
neighborhood exploration process will end in a finite number of steps but the num¬ 
ber of steps may depend on n and is equal to the diameter, L, of the connected 
component of the graph containing the root y. It follows from Theorem 14.11 of [9] 
that 


VlogA,(«)^l- (14) 

Now, we find a coupling relation between the neighborhood exploration process 
of a vertex of type a in stochastic block model and a multi-type Galton-Watson 
process, ^(a) starting from a vertex of type a. The Lemma is based on Proposition 
31 of [10]. 

Lemma 1. Let w{n) be a sequence such that w(n) —>■ “o and w(n)/n —^ 0. Let (T, y) 
be the random rooted tree associated with the Poisson multi-type Galton-Watson 
branching process defined in Section 2.1 started from Zq = 5c„ and (G,y) be the 
spanning tree associated with neighborhood exploration process of random SBM 
graph G„ starting from v. For i < T, where T is the number of steps required to 
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explore w(n) vertices in (G, v), the total variation distance, djy, between the law of 
(G, v)^ and {T,v)i at step I goes to zero as O v w{n)/n'j = o(l). 

Proof. Let us start the neighborhood exploration process starting with vertex v of 
a graph generated from an SBM model with parameters (P, tt) = {B/n,n). Corre¬ 
spondingly the multi-type branching process starts from a single particle of type c,,, 
where, Cv is the type or class of vertex v in SBM. 

Let t be such that 0 < f < T, where, T is defined in the Lemma statement. Now, 
for such a f > 0, let (xr+i (1),... ,Xr+i (/T)) be leaves of {T,v) at time t starting 
from a vertex v? generated by step t of class c,,, = a. Let (y,+i (1),... (/T)) 

be the vertices exposed at step t of the exploration process starting from a vertex 
of class a, where, a € [/T]. Now, if Cy, is of type a, then, we have Xt+i{b) follows 
Bm{nt{b),Bab/n) and yt+i{b) follows Poi{ni,Bai,) for b = where, nt{b) 

is the number of unused vertices of type b remaining at time t for b = 

Also, {b) for different b are independent. Note that iij, > nt{b) >ni, — w{n) for 
b = So, since, we have |«*/n — %| = 0{n^^/^) for b = 1,... ,/r, we get 

that, 

\nt{b) — TTfol <0 for /? = \ 

Now, we know that, 

X 

c/rv (Bin(m',l/m),Poi(m'/l/m)) < —, (Poi(/l),Poi(l')) < \X — X'\ 

So, now, we have, 

drv (Pt+i,Qi+i) <o[n^^l‘^yw{n)/i^ =o(l) 

where, Pt+\ is the distribution of under neighborhood exploration process and 
Qt+\ is the distribution of under the branching process, and hence Lemma 1 
follows. 

Now, we restrict ourselves to the giant component of G„. The size of the giant 
component of G„, (G„), of a random graph generated from SBM(B, n) is related 

to the multi-type branching process through its survival probability as given in Def¬ 
inition 5. According to Theorem 3.1 of [9], we have, 

—"^1 (G„) —>■ p (B, 7t) 
n 


(15) 
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Under this additional condition of restricting to the giant component, the branching 
process can be coupled with another branching process with a different kernel. The 
kernel of that branching process is given in following lemma. 

Lemma 2. If v is in giant component of Gn, the new branching process has kernel 
{2p{B,n)/K-pHB,n)/K^))l^^^. 

Proof The proof is given in Section 10 of [9]. 

Since, we will be restricting ourselves to the giant component of G„, we shall be 
using the B' = (Bab p^(B,7t)/K^))^^_^ matrix as the connectivity 

matrix in stead of B. We abuse notation by referencing to the matrix B' as B too. 

We proceed to prove the limiting behavior of typical distance between vertices 
V and w of G„, where, v,w GV (Gn)- We first try to find a lower bound for distance 
between two vertices. We shall separately give an upper bound and lower bounds 
for the distance between two vertices of the same type and different types. 

Lemma 3. Under our model, for vertices v,w GV (G), if 

(a) type ofv = type ofw = a (say), then, 

|{{v, w} : dG{v,w) < (1 — e)Ti}| < 0 (n^^^) with high probability 

where, Ti is the minimum real positive t, which satisfies Eq. (6), 

(b) type ofv = a f b = type of w (say), then, 

|{{v, w} : dG{v,w) < (1 — e)T 2 }| < 0 (n^^^) with high probability 

where, X 2 is the minimum real positive t, which satisfies Eq. (8). 

Proof Let r^{v) = rii(v,G„) denote the r/-distance set of v in G„, i.e., the set of 
vertices of Gn at graph distance exactly d from v, and let r<d{v) = r<d{v,Gn) de¬ 
note the -neighborhood of v. Let irfa(v) = Qa{v,Gn) denote the set 

of vertices of type a at c/-distance in G„ and let G<d,a{v) =r<d,a{v, Gn) denote the 
c/-neighborhood „(v) of v consisting of vertices of type a. Let be the 

number of particles at generation d of the branching process 3§b (Sa) and ^ be the 
number of particles at generation d of the branching process ^B{Sa) of type c. So, 
K = I^c=iK,c and z, (k) = Zi=oK,k- 

Lemma 1 involved first showing that, for n large enough, the neighborhood ex¬ 
ploration process starting at a given vertex v of G„ with type a could be coupled 
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with the branching process S§gi{5a), where the B' is defined by Lemma 2. As noted 
we identify B' with B. 

The neighborhood exploration process and multi-type branching process can be 
coupled so that for every d, |-Q(v') | is at most the number Nd + O v w(n)/«^, 
where, Nj is number of particles in generation d of SSBi^a) and in d generations at 
most w{n) vertices of G„ have been explored. 

From Theorem 4, we get that with high probability 


K 




<C(f + l)2(logn)3/2 


Since, for any x € we get the unique representation, x = for any 

basis {(j>k}k=i of If we take x = et, where, et is the unit vector with 1 at h-th 
co-ordinate and 0 elsewhere, b= can get 


Z,{b) < £ Ub)KUa) [Zo(fl)+C(f+l)2(logn)3/2 


k=\ 


Now, under our model one representation of the eigenvectors is 0i = (1, ■■■,!), 

02 = ^(-1,1,O,...,O),(/)3 = ^(-1,-1,2,O,...,O),---, 


fe-i = 


(—1,..., — IjA"—1). Now using the representation of eigenvectors 


y/K(K-\) 

for branching process starting from vertex of type a, a € [/T], we get with high 
probability 


£z,(k) < [zo(fl)+C(r + l)2(log«)3/2 


k=l 


Zt{a) — Z,{b) > —Zo(a) — C(f-f l)^(log«)^/^ , b = I,... ,K and b ^ a. 

So, we can simplify, for each a G [/T] with Zo(a) = 1, with high probability. 


Z,{a)<- (1[ + (/:-! )Xi) [l + C(f + 1 f{\ognfl^ 

Z,{b)<——— 1-|-C(f-|-l)^(logn)^/^ , bG[K]andb^a. 
K \- J 

Set D\ = { \ — e)T\, where, T\ is the solution to the equation 






K 


and set D 2 = (1 — e)T 2 , where, T 2 is the solution to the equation 
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where, £ > 0 is fixed and small. Note that both Ti and T2 are of the order 0{\ogn). 
Thus, with high probability, for v of type a and w{n) = 

|r<D.,«(v)| + Vw(«)/«) =0(n'-®) 

\r<D,,biy)\ =L°ioNli,<ZD,ib) + o(^D2n-^Vw{n)/n)=0{n'-^) 

So, summing over v GCa and v G Ct, where, Ca = {i GV (G) jc,- = a} and Ci, = {i G 
V (G)|c,' = b}, we have, 

Y, = \{{v,w} ■ dG{v,w) < (1-£)ti,v,wGC„}| 

VGCa 

Y \^<D2,b{'^)\ = : dG{v,w) < (1 - £)T2,V G Ca,W G Cfo}| 

veCa 

and so with high probability 

|{{v,w'} : t/G(v,w) < (1 -£)Ti,V,wGC„}| = Y \r<D,a{'^)\ = 

vev{Gn) 

|{{v,w} : c/g(v,w) < (1-£)T 2,VG C„,WG Cftll = Y 

vev{G„) 

The above statement is equivalent to 

P[|{{'':>v} : dG{v,w) < (1 -£)ti,v,w G Ca}\ < = 1 -0(1) 

P[|{{t',vv} : dG{v,w) < (1 -£)T2,v GCa,wG Cb}\ < 0{n^^‘^)] = 1 -o(l) 

for any fixed £ > 0. 

Now, we upper bound the typical distance between two vertices of SBM graph 
G„. 

Lemma 4. Under our model, for vertices v, w G V (G) and conditioned on the event 
that the exploration process starts from a vertex in the giant component of G, if 

(a) type ofv = type ofw = a (say), then, 

P{dGiv,w) < (1 +£)ti) = 1 — exp{—Q{n^^)) 

where, X\ is the minimum real positive t, which satisfies Eq. (6), 

(b) type ofv = af^b = type ofw (say), then. 
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?{dc{v,w) < (1 +e)T 2 ) = I - exp{-Q.{n^'^)) 


where, T 2 is the minimum real positive t, which satisfies Eq. (8). 

Proof. We consider the multi-type branching process with probability kernel Pab = 


^\/a,b=l,...,K and the corresponding random graph G„ generated from stochas¬ 


tic block model has in total n nodes. We condition that branching process sur¬ 
vives. 

Note that an upper bound 1 is obvious, since we are bounding a probability, so 
it suffices to prove a corresponding lower bound. We may and shall assume that 
Bab > 0 for some a,b. 

Again, let irf(v) = ^{v^Gn) denote the c/-distance set of v in G„, i.e., the set 
of vertices of G„ at graph distance exactly d from v, and let B<d{v) = r<d{v,Gn) 
denote the c/-neighborhood Gcii<dGd'{v) of v. Let Ed^aiv) = rd^a{v,Gn) denote the 
set of vertices of type a at r/-distance in G„ and let B<d.a{v) = r<d,a{y, G„) denote 
the r/-neighborhood ^(v') of v consisting of vertices of type a. Let be 

the number of particles at generation d of branching process 3§B{Ba) and be 
the number of particles at generation d of branching process 3§B{Ba) of type c. So, 



By Lemma 1, for w{n) = o{n). 



for all d s.t. |fL£/(v)| < (o{n). This relation between the number of vertices at gen¬ 
eration d of type c of branching process t3§B{Ba), denoted by Nd.c and the number 
of vertices of type c at distance d from v for the neighborhood exploration process 
of G„, denoted by |fd c(v)| becomes highly important later on in this proof, where, 
c= 1,... ,K. Note that the relation only holds when |f<j/(v')| < (o{n) for some (oiii) 
such that (o{n) jn —>■ 0 as n —>■ 00 . 

From Theorem 4 of the branching process, we get that with high probability 


<C(logn)3/2 


Now following the same line of argument as in proof of Lemma 3, for each 
a G [A] with Zo(a) = 1, with high probability we get that. 
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Zr(a) < -(Xl + (K-1)^) [l+C(f+l)2(log«)3/2 
K i 

0 r _ 0 r r -| 

Zt(b) < — - - 1 + C(f + l)^(log«)^/^ , b &\K] dinAb ^ a. 

Let D\ be the integer part of (1 +277 )t(, where, t( is the solution to the equation 

Af-Ar 




K 




(17) 


Thus conditioned on survival of the branching process 3§B{Sa), a — 
Set Di = (1 + 77)T2, where, Tj is the solution to the equation 


A[ = 


(18) 


Thus conditioned on survival of branching process 3§B{.8a), ^ > ni/^+’l/^ for 

b = Furthermorelimf/^ooP(A^^ 7 ^ 0) = p{B,a). 

Now, we have conditioned that the branching process with kernel B is surviving. 
The right-hand side tends to p (B,a) = 1 as 77 —> 0. Hence, given any fixed 7 > 0, if 
we choose rj > 0 small enough, and for large enough n, we have 

= 1 : 

= 1 . 

Now, the neighborhood exploration process and branching process can be cou¬ 
pled so that for every d, |id(v')| is at most the number of particles in generation 
d of S§B{a) from Lemma 1 and Eq (16). So, we have for v of type a, with high 
probability, 

d=0 

|r<fo,fc(v)| <Ef N^ = o(n2/3) 
d=0 

if 77 is small enough, since Di is integer part of (1 -|-277 )t( and D 2 is the integer 
part of (1 -|-277 )t 2, where, t[ and solutions to Eq. (17) and (18). Note that the 
power 2/3 here is arbitrary, we could have any power in the range (1 /2,1). So, now, 
we are in a position to apply Eq (16), as we have |f<D(v')| < < Co(n), with 

CO{n)/n —i> 0. 
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Now let V and w be two fixed vertices of G{n,P), of types a and b respectively. 
We explore both their neighborhoods at the same time, stopping either when we 
reach distance D in both neighborhoods, or we find an edge from one to the other, in 
which case v and w are within graph distance 2D + 1. We consider two independent 
branching processes P§B{a), 3§'g{b), with and vertices of type c in gener¬ 
ation d respectively. By the previous argument, with high probability we encounter 
o(«) vertices in the exploration so, by the argument leading to (16), whp either the 
explorations meet, or 

|Jrf'e(w)| c=l,...,K,Cy^a 

\rj’,c{w)\ > zP(c) - O (n-^ V , c = 1,... ,K,c ^ b 

with the explorations not meeting, where, is the branching process starting from 
Zo = 5a, for a = Using bound on ^ and the independence of the branch¬ 

ing processes, it follows that for a = b, 

Y>(d{v,w) < 2 Z)i + 1 or .,(v)Uro“ ,,(w)| > > 1 -o(l). 

and for a^b, 

P(d{v,w) < 2 D 2 + 1 or Vc : > 1 -o(l). 

Write these probabilities as P(Aj GBj), j = 1,2. We now show that P(Af (IBj) 0 
and since P{AjUBj) —?> 1, we will have P(Ay) —?> 1. We have not examined any 
edges from Pb(v') to Pb(w), so these edges are present independently with their 
original unconditioned probabilities. For any end vertex types ci, C 2 , the expected 
number of these edges is at least KMKci w)\Bc^c 2 />^■ for first probability 
and \r^ cic 2 /’T- for second probability. Choosing ci,C 2 such that 

BciC 2 > this expectation is = Q{n^). It follows that at least 

one edge is present with probability 1 — exp(—f2(nt)) = 1 — o(l). If such an edge 
is present, then d{v,w) < 2Di + 1 for first probability and d{v,w) < 2D\ -f 1 for 
second probability. So, the probability that the second event in the above equation 
holds but not the first is o(l). Thus, the last equation implies that 

P(c/(v,w) <2Di-f 1) > (l- 7 )^-o(l) > 1 - 27 - 0 ( 1 ) 
P{d{v,w)<2D2+\) > (l- 7 )^-o(l) > 1 - 27 - 0 ( 1 ). 


Spectral Clustering and Block Models: A Review And A New Algorithm 


21 


where, 7 > 0 is arbitrary. Choosing rj small enough, we have 2D +1 < (1 + 
e)log(n)/logA. As yis arbitrary, we have 

P(d(v,w) < (1 +e)Ti) > 1 — exp(— 

P(d(v,w) < (1 +e)T 2 ) > 1 — exp(— 

and the lemma follows. 

The equations ( 6 ) and ( 8 ) control the asymptotic bounds for the graph distance 
da (v', w) between two vertices v and w in V (G„). Under the condition (A3) it follows 
that > Ai. If we consider A| = cAi, where, c is a constant, then the equations ( 6 ) 
and ( 8 ) can be written in the form of quadratic equations. So, the solutions Ti and T 2 
exist under the condition and c^- are of the order (9 (n) and the resulting solutions 
Ti and T 2 are both of the order 0{logn). Also, from the expression of the solutions 
Ti and T 2 , the limits and exist and we shall define the limit as ( 7 i and O 2 
respectively. 


4.3 Proof of Theorem 2 and Theorem 3 

4.3.1 Proof of Theorem 2 

We shall try to prove the limiting behavior of the typical graph distance in the giant 
component as n ^ The Theorem essentially follows from Lemma 3-4. Under 
the conditions mentioned in the Theorem, part (a) follows from Lemma 3(a) and 
4(a) and part (b) follows from Lemma 3(b) and 4(b). 


4.3.2 Proof of Theorem 3 

From Definition 4, we have that D,; = graph distance between vertices v/ and vj, 
where, Vi,Vj € U(G„). From Lemma 3, we get for any vertices v and w with high 
probability, 

|{{v,w} ; dG{v,w) < (1 — e)Ti}| < if type of v = type of w 

|{{v,w} ; dcivjw) < (1 — e)T 2 }| < if type of v ^ type of w. 


Also, from Lemma 4, we get 
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P(c/g(v,w) < (1 + £)ti) = 1 — exp(—X2(n^^)), if type of v = type of w, 
P{dG{v,w) < (1 + e)T 2 ) = 1 — exp(—if type of v = type of w. 


Now, ai = Ti/logn and 02 = '^i/^ogn are asymptotically constant as both Ti and 
T 2 are of the order log« as follows from equations (6) and (8). So, putting the two 
statements together, we get that with high probability, 

t = 0{n^-^) + 0{n^).e^ 

iJ=l'-type{vi)^type{vj) V / 


since, by Lemma 1, £ = o(l) and (1 — exp(——>■ 1 as « —>■ 0 °. So, putting 
the two cases together, we get that with high probability, for some £ > 0, 


n 


E 




0{n^-^) + 0{rp-).e^ 




Hence, for some £ > 0, 


D 

log« 



< o(«). 


We have completed proofs of Theorems 2 and 3. 


4.4 Perturbation Theory of Linear Operators 

We now establish part II of our program. D can be considered as a perturbation of 
the operator D. 

The Davis-Kahan Theorem [13]] gives a bound on perturbation of eigenspace 
instead of eigenvector, as discussed previously. 

Theorem 6 (Davis-Kahan (1970)[13]). Let H,H' G be symmetric, suppose 
'Y C M L an interval, and suppose for some positive integer d that W,W' G 
are such that the columns of W form an orthonormal basis for the sum of the 
eigenspaces ofH associated with the eigenvalues o/H in 'Y and that the columns of 
W' form an orthonormal basis for the sum of the eigenspaces o/H' associated with 
the eigenvalues o/H' in Y. Let 5 be the minimum distance between any eigenvalue 
o/H in Y and any eigenvalue o/H not in Y . Then there exists an orthogonal matrix 
R G such that ||WR- W'IIf < 
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The behavior of the eigenvalues of the limiting operator D can be stated as follows 


Lemmas. Under our model, the eigenvalues o/D - |/ri(D)| > |/i 2 (D)| > ••• > 
|/ 2 „(D)|, can be bounded as follows - 

^i(D) = (9(«(7 i), |aia:(D)I = 0{n{ai- 02 )), fi/r+iW = ••• =Mn(D) = -c^i 

(19) 

Also, With high probability it holds that |/i;f(D/logn)| = 0{n{Gi — G 2 )) and 
MA:+i(D/log«) < o(«). 

Proof. The matrix D + (7il„xn is a block matrix with blocks of sizes {na}a=i^ with 
I^«=i The elements of {a,b)ih block are all same and equal to a\, if a = b 

and equal to <72, if afb. Note, diagonal of D is zero, as diagonal of D is also zero. 
Now, we have the eigenvalues of the K x K matrix of the values in D to be (ffi + 

{K — 1 )(72, Cl ~ C 2 ,..., (Ji — 02 ). If we consider, = cAi, then, if c > 1, we will 
have (7i > 02 . So, under our model, we have that a\ > 02 - So, because of repetitions 
in the block matrix (D) = 0(nai) = 0{n) and /iA'C®) = 0{n{a\ — ( 72 )) = 0{n), 
since, by assumption (A3), n^ = 0{n), for all a = 1,... ,K. Now, the rest of the 
eigenvalues of D + (7ild„xn is zero, so the rest of eigenvalues of D is —cy\. 

Now, about the second part of Lemma, By WeyTs Inequality, for all i = 1,..., n, 

||^;(D/log«)| - |A,-(D)|| < ||D/log«-D||^ < o(«) 

Since, from (A1)-(A3), it follows that <7i — <72 > c > 0, for some constant c, so, 
|AA:(D/log«)| =(9(«(<7 i-<72))-o(«) = <9(«(Ci-< 72)) for large Hand |A/i:+i(D/log«)| 
—ai +o(«) = o(«). 

Now, let W be the eigenspace corresponding to the top K absolute eigenvalues of 
D and W be the eigenspace corresponding to the top K absolute eigenvalues of D. 
Using Davis-Kahan 

Lemma 6. With high probability, there exists an orthogonal matrix R G such 

that ||WR-W||f <o((c7i-(72 )^') 

Proof The top K eigenvalues of both D and D/log« lies in {Cn,°°) for some C > 0. 
Also, the gap 5 = 0{n{a\ — 02 )) between top K and K + 1th eigenvalues of matrix 
D. So, now, we can apply Davis-Kahan Theorem 6 and Theorem 3, to get that. 
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|WR-W||f < < 


o(«) 


= o((c7i-(T2) 0 


5 0{n{ai-a2)) 

Now, the relationship between the rows of W can be specified as follows - 

Lemma 7. For any two rows i,j ofWnxK matrix, | Im/ — My ||2 > 0{l/ 'Jn), if type of 
Vi type ofvj. 


Proof The matrix D + aild„xn is a block matrix with blocks of sizes with 

'Lf ,=\The elements of (a,fe)th block are all same and equal to 0 \, if a = b 
and equal to C72j if a f^b. Note, diagonal of D is zero, as diagonal of D is also zero. 
Now, we have the rows of eigenvectors of the K x K matrix of the values in D that 
have a constant difference. Under our model, we have that di > 02 - So, because of 
repetitions in the block matrix, rows of ID as well as the projection of D into into its 
top K eigenspace has difference of order between rows of matrix. 


Now, if we consider /if-means criterion as the clustering criterion on W, then, for 
the /T-means minimizer centroid matrix C is an « x /T matrix with K distinct rows 
corresponding to the K centroids of /T-means algorithm. By property of /if-means 
objective function and Lemma 6, with high probability. 


||C-W||f < ||WR-W||f 
C-WR||f < ||C-W||/r + ||WR-W||/r 
C-WR|||. < 4||WR-W||^ 

< O ((di - d2)^^) 


By Lemma 7, for large n, we can get constant C, such that, K balls, ,... ,Bk, 
of radius r = around K distinct rows of W are disjoint. 

Now note that with high probability the number of rows i such that ||C, — 
(WR),j| > r is at most ^ with arbitrarily small constant c > 0. If the state¬ 

ment does not hold then. 


IC-WRI 


>r2. 


(di - d2)2 




So, we get a contradiction, since | |C — WR| < o ((di — d 2 ) . Thus, the number 

of mistakes should be at most with arbitrarily small constant c > 0. 
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So, for each v,- G V{G„), if c(v,) is the type of v,- and c{vi) is the type of v, as 
estimated from applying fC-means on top K eigenspace of geodesic matrix D, we 
get that for arbitrarily small constant, c > 0, 


-El (c(v;) 7^ c(v,)) < 

n ^ 


,=i {oi-aif- 

So, for constant ai and (72, we get c > 0 such that. 


-El (c(v'i) -h ^(v/)) < \ 


1 


5 Conclusion 

We have given an overview of spectral clustering in the context of community detec¬ 
tion of networks and clustering. We have also introduced a new method of commu¬ 
nity detection in the paper and we have shown bounds on theoretical performance 
of the method. 
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